Document node allocatable enhancements #4532

derekwaynecarr · 2017-06-02T19:12:33Z

this is a major feature in OCP 3.6 for node reliability.

any improvements to readability and or clarity are appreciated.

derekwaynecarr · 2017-06-02T19:13:26Z

sjenning

mostly nits. nothing too important. just trying to make a confusing thing less confusing with uniform and precise wording.

sjenning · 2017-06-03T18:37:23Z

admin_guide/allocating_node_resources.adoc

@@ -73,9 +73,18 @@ introduction of allocatable resources.
 An allocated amount of a resource is computed based on the following formula:

 ----
-[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved]
+[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved] - Hard-Eviction-Thresholds]


nit: missing [

sjenning · 2017-06-03T18:40:54Z

admin_guide/allocating_node_resources.adoc

 ----

+[NOTE]
+====
+The reduction of `Hard-Eviction-Thresholds` from allocatable is a change in behavior to improve


s/reduction/withholding/

sjenning · 2017-06-03T18:41:54Z

admin_guide/allocating_node_resources.adoc

+[[node-enforcement]]
+== Node enforcement
+
+The node is able to enforce the total amout of compute resources that pods


typo: amount

s/enforce/limit/

sjenning · 2017-06-03T18:47:53Z

admin_guide/allocating_node_resources.adoc

+
+The node is able to enforce the total amout of compute resources that pods
+may consume based on the configured allocatable value.  This feature significantly
+improves the reliability of the node by preventing end-user pods from starving


"end-user" as opposed to.. what? if pods are pods, i'd delete "end-user".

in the future, we may let special users add pods to the "system" and "kube-reserved" cgroups. in that context "end-user" meant not those, but in this context, i agree its confusing.

sjenning · 2017-06-03T18:48:27Z

admin_guide/allocating_node_resources.adoc

+may consume based on the configured allocatable value.  This feature significantly
+improves the reliability of the node by preventing end-user pods from starving
+system services (i.e. container runtime, node agent, etc.) of required compute
+resources.  It is strongly encouraged that administrators set aside system


s/of required compute resources/for resources/

s/set aside/reserve/

Avoid i.e and etc.
"(for example: container runtime, node agent, and so on)

sjenning · 2017-06-03T19:07:58Z

admin_guide/allocating_node_resources.adoc

+been reclaimed. To avoid (or reduce the probability of) system OOMs the node
+provides xref:../admin_guide/out_of_resource_handling.adoc[Out Of Resource Handling].
+
+By reserving some memory via the `*--eviction-hard*` flag, the node attempts to evict


There is a mixing of "setting", "flag", and "argument" terminology. We might want to unify this.

s/*--eviction-hard*/--eviction-hard

sjenning · 2017-06-03T19:09:11Z

admin_guide/allocating_node_resources.adoc

+and system components use up all their reservation, the memory available for pods is `28.9Gi`,
+and kubelet will evict pods when it exceeds this usage.
+
+If we enforce node allocatable (`28.9Gi`) via top level cgroups, then pods can never exceeds `28.9Gi`


typo: exceed

sjenning · 2017-06-03T19:15:57Z

admin_guide/allocating_node_resources.adoc

+If we enforce node allocatable (`28.9Gi`) via top level cgroups, then pods can never exceeds `28.9Gi`
+in which case evictions would not be performed unless kernel memory consumption is above `100Mi`.
+
+In order to support evictions and avoid memcg OOM kills for pods, the node sets the top level cgroup


s/memcg OOM killls for pods/pods being OOM killed inside the kubepods cgroup/

explaining the kubepods cgroup could be worth it so we don't have to use different, confusing language each time we are referring to it. "pod-level cgroup" already means something else so...

i want to avoid describing the cgroup hierarchy too much here to be honest.

sjenning · 2017-06-03T19:20:06Z

admin_guide/allocating_node_resources.adoc

+is the same for user pods.
+
+With the above example, node will evict pods whenever pods consume more than `28.9Gi` which will be 
+`<100Mi` from `29Gi` which will be the memory limits on the Node Allocatable cgroup.


"Node Allocatable cgroup" == kubepods? I find this sentence confusing.

sjenning · 2017-06-03T19:22:41Z

admin_guide/allocating_node_resources.adoc

+and kubelet will evict pods when it exceeds this usage.
+
+If we enforce node allocatable (`28.9Gi`) via top level cgroups, then pods can never exceeds `28.9Gi`
+in which case evictions would not be performed unless kernel memory consumption is above `100Mi`.


So... I think this is made with assumption we are using all the node-allocatable enforcement groups; pods, system-reserved, and kube-reserved.

mburke5678 · 2017-06-05T15:53:54Z

admin_guide/allocating_node_resources.adoc

+    - "pods" <3>
+----
+<1> Enable or disable the new cgroup hierarchy managed by the node.  Any change
+of this setting requires a full drain of the node.  It is strongly encouraged


s/It is strongly encouraged that users not change this value."/We recommend that users do not change this value.

mburke5678 · 2017-06-05T15:55:18Z

admin_guide/allocating_node_resources.adoc

+enforce node allocatable.
+<2> The cgroup driver used by the node when managing cgroup hierarchies.  This
+value must match the driver associated with the container runtime.  Valid values
+are "systemd" and "cgroupfs".  The default is "systemd".


s/"systemd"/systemd
s/"cgroupfs"/cgroupfs

mburke5678 · 2017-06-05T16:01:33Z

admin_guide/allocating_node_resources.adoc

+corresponding --kube-reserved-cgroup or --system-reserved-cgroup needs to be provided.
+In future releases, the node and container runtime will be packaged in a common cgroup
+separate from `system.slice`.  Until that time, it is not encouraged for users to
+change the default value of enforce-node-allocatable flag.


s/it is not encouraged for users to change/we recommend not changing

mburke5678 · 2017-06-05T16:05:29Z

admin_guide/allocating_node_resources.adoc

+separate from `system.slice`.  Until that time, it is not encouraged for users to
+change the default value of enforce-node-allocatable flag.
+
+System daemons are expected to be treated similar to Guaranteed pods. System daemons


Who is expecting them to be treated similar?
"{product-title} expects system daemons to be treated similar..." ??
"System daemons should be treated similar..." ??

mburke5678 · 2017-06-05T16:07:07Z

admin_guide/allocating_node_resources.adoc

+exhaustively to come up with precise estimates and are confident in their ability to 
+recover if any process in that group is oom_killed.
+
+As a result, it is strongly recommended that users only enforce node allocatable for


s/it is strongly recommended/we strongly recommend

mburke5678 · 2017-06-05T16:10:55Z

admin_guide/allocating_node_resources.adoc

+
+By reserving some memory via the `*--eviction-hard*` flag, the node attempts to evict
+pods whenever memory availability on the node drops below the reserved value.
+Hypothetically, if system daemons did not exist on a node, pods cannot use more than


"Hypothetically"?? Do we not know for certain?

s/did not/do not

mburke5678 · 2017-06-05T16:12:10Z

admin_guide/allocating_node_resources.adoc

+In order to support evictions and avoid memcg OOM kills for pods, the node sets the top level cgroup
+limits for all pods to be `Node Allocatable + Eviction Hard Thresholds`.
+
+However, the scheduler is not expected to scheduler more than `28.9Gi` and so `Node Allocatable` on node


s/to scheduler/to schedule

mburke5678 · 2017-06-05T16:13:37Z

@derekwaynecarr Some style comments.

derekwaynecarr · 2017-06-05T21:17:55Z

@sjenning @mburke5678 -- made some updates, ptal

sjenning

just a few more. then lgtm.

sjenning · 2017-06-05T21:40:33Z

admin_guide/allocating_node_resources.adoc

+[[node-enforcement]]
+== Node enforcement
+
+The node is able to limit the total amount of compute resources that pods


nit: remove "compute"

sjenning · 2017-06-05T21:44:01Z

admin_guide/allocating_node_resources.adoc

+
+Administrators should treat system daemons similar to Guaranteed pods.  System daemons
+can burst within their bounding control groups and this behavior needs to be managed
+as part of cluster deployments.  Enforcing system-reserved reservations


s/reservations/limits/

sjenning · 2017-06-05T21:49:09Z

admin_guide/allocating_node_resources.adoc

+provides xref:../admin_guide/out_of_resource_handling.adoc[Out Of Resource Handling].
+
+By reserving some memory via the `--eviction-hard` flag, the node attempts to evict
+pods whenever memory availability on the node drops below the reserved value.


s/reserved value/certain value or percentage/

sjenning · 2017-06-05T21:51:08Z

admin_guide/allocating_node_resources.adoc

+By reserving some memory via the `--eviction-hard` flag, the node attempts to evict
+pods whenever memory availability on the node drops below the reserved value.
+If system daemons did not exist on a node, pods are limited to the memory
+`capacity - eviction-hard`. For this reason, resources reserved for evictions are not


s/reserved for evictions/set aside as a buffer for eviction before reaching out of memory conditions/

vikram-redhat · 2017-06-06T06:08:58Z

Document: [node] Need guarantees on OS cgroup reservation [ops-rfe]

qwang1 · 2017-06-13T06:57:30Z

admin_guide/allocating_node_resources.adoc

+
+As a result, we strongly recommended that users only enforce node allocatable for
+`pods` by default, and set aside appropriate reservations for system daemons to maintain
+overall node reliability.


Does this mean, it's recommended likes this?

kubeletArguments: cgroups-per-qos: - "true" cgroup-driver: - "systemd" enforce-node-allocatable: - "pods" kube-reserved: - "cpu=200m,memory=30G" system-reserved: - "cpu=200m,memory=30G" eviction-hard: - "memory.available<1Gi"

@qwang1 - for the described scenario, this would be the example configuration.

we explicitly avoid defining a recommendation for the reservation at this time as its a factor of pod density.

kubeletArguments: cgroups-per-qos: - "true" cgroup-driver: - "systemd" enforce-node-allocatable: - "pods" kube-reserved: - "memory=2GI" system-reserved: - "cpu=1,memory=1Gi" eviction-hard: - "memory.available<100Mi"

derekwaynecarr · 2017-06-16T04:11:22Z

@sjenning - updated.

mburke5678 · 2017-06-19T17:40:41Z

@derekwaynecarr Is this ready for merge?
Is the associated Trello, Need guarantees on OS cgroup reservation, related to Enforce QoS level memory limits, as they are both discuss cgroups?

derekwaynecarr · 2017-06-19T21:11:23Z

this is ready to merge. this doc should cover https://trello.com/c/QPRTj0uW/524-document-node-need-guarantees-on-os-cgroup-reservation-ops-rfe , @sjenning should write a follow-up to document the alpha support for https://trello.com/c/wq1qcjDa/455-document-node-enforce-qos-level-memory-limits which is a new flag --experimental-qos-reserved

sjenning reviewed Jun 3, 2017

View reviewed changes

mburke5678 reviewed Jun 5, 2017

View reviewed changes

sjenning reviewed Jun 5, 2017

View reviewed changes

qwang1 reviewed Jun 13, 2017

View reviewed changes

Document node allocatable enhancements

4dc25cd

mburke5678 merged commit fc53aa1 into openshift:master Jun 19, 2017

mburke5678 added the branch/enterprise-3.6 label Jun 19, 2017

mburke5678 added this to the Future Release milestone Jun 19, 2017

vikram-redhat modified the milestones: Future Release, Staging Jul 7, 2017

vikram-redhat removed this from the Future Release milestone Aug 9, 2017

vikram-redhat modified the milestones: Staging, Future Release, OCP 3.6 GA Aug 9, 2017

Document node allocatable enhancements #4532

Document node allocatable enhancements #4532

Conversation

derekwaynecarr commented Jun 2, 2017

derekwaynecarr commented Jun 2, 2017

sjenning left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mburke5678 commented Jun 5, 2017

derekwaynecarr commented Jun 5, 2017

sjenning left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vikram-redhat commented Jun 6, 2017

qwang1 Jun 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

derekwaynecarr commented Jun 16, 2017

mburke5678 commented Jun 19, 2017 • edited

derekwaynecarr commented Jun 19, 2017

qwang1 Jun 13, 2017 •

edited

mburke5678 commented Jun 19, 2017 •

edited